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We  review  concepts,  principles,  and  tools  that  unify  current  ap¬ 
proaches  to  causal  analysis,  and  attend  to  new  challenges  presented 
by  big  data.  In  particular,  we  address  the  problem  of  data-fusion 
-  piecing  together  multiple  datasets  collected  under  heterogeneous 
conditions  (i.e.,  different  populations,  regimes,  and  sampling  meth¬ 
ods)  so  as  to  obtain  valid  answers  to  queries  of  interest.  The  avail¬ 
ability  of  multiple  heterogeneous  datasets  presents  new  opportuni¬ 
ties,  since  the  knowledge  that  can  be  acquired  from  combined  data 
would  not  be  possible  from  any  individual  source  alone.  However, 
the  biases  that  emerge  in  heterogeneous  environments  require  new 
analytical  tools.  Some  of  these  biases,  including  confounding,  sam¬ 
pling  selection,  and  cross-population  biases,  have  been  addressed  in 
isolation,  largely  in  restricted  models.  We  here  present  a  general, 
non-parametric  framework  for  handling  these  biases  and,  ultimately, 
a  theoretical  solution  to  the  problem  of  data-fusion  in  causal  and 
counterfactual  inference. 

causal  inference  |  counterfactuals  |  big  data  |  confounding  |  external  validity 

meta-analysis  |  heterogeneity  |  selection  bias  |  data  integration 

Introduction  -  Causal  Inference  and  Big  Data 

The  exponential  growth  of  electronically  accessible  informa¬ 
tion  has  led  some  to  conjecture  that  data  alone  can  replace 
scientific  knowledge  in  practical  decision  making.  In  this  pa¬ 
per,  we  argue  that  the  hypothetico-deductive  paradigm  that 
has  been  successful  in  the  natural  and  bio-medical  sciences 
is  still  necessary  for  big  data  applications,  albeit  augmented 
with  new  challenges.  These  challenges  turn  into  opportunities 
when  viewed  from  the  prism  of  causal  inference,  a  field  that  has 
scored  major  advances  in  the  past  two  decades,  by  borrowing 
two  methodological  principles  from  the  hypothetico-deductive 
paradigm.  First,  a  commitment  to  understanding  what  reality 
must  be  like  for  a  statistical  routine  to  succeed  and,  second,  a 
commitment  to  represent  reality  in  terms  of  data-generating 
models,  rather  than  distributions  of  observed  variables. 

Encoded  as  non-parametric  structural  equations,  these 
models  have  led  to  a  fruitful  symbiosis  between  graphs  and 
counterfactuals  and  has  unified  the  potential  outcome  frame¬ 
work  of  Neyman,  Rubin,  and  Robins  with  the  econometric 
tradition  of  Haavelmo,  Marschak,  and  Heckman.  In  this  sym¬ 
biosis,  counterfactuals  (or  potential  outcomes)  emerge  as  nat¬ 
ural  byproducts  of  structural  equations  and  serve  to  formally 
articulate  research  questions  of  interest.  Graphical  models,  on 
the  other  hand,  are  used  to  encode  scientific  assumptions  in  a 
qualitative  (i.e.,  non-parametric)  and  transparent  language  as 
well  as  to  derive  the  logical  ramifications  of  these  assumptions, 
in  particular,  their  testable  implications  and  how  they  shape 
behavior  under  interventions. 

In  this  paper,  we  build  on  the  semantical  clarity  that  arises 
from  this  unification  and  give  precise  meaning  to  fundamen¬ 
tal  concepts  such  as:  units,  populations,  models,  experimental 
conditions,  sampling  selection,  counterfactuals,  causal  effects, 
assignment  mechanism,  and  more.  These  concepts  will  be  de¬ 
fined  formally  in  the  next  section  as  part  of  the  Structural 
Causal  Model  (SCM)  framework. 

One  unique  feature  of  the  SCM  framework,  essential  in  big 
data  applications,  is  the  ability  to  encode  mathematically  the 


method  by  which  data  are  acquired,  often  referred  to  genet¬ 
ically  as  the  “design.”  This  sensibility  to  design,  which  we 
can  label  proverbially  as  not  all  data  are  created  equal,  is  il¬ 
lustrated  schematically  through  a  series  of  scenarios  depicted 
in  Fig.  1.  Each  design  (shown  at  the  bottom  of  the  figure) 
represents  a  triplet  specifying  the  population,  the  regime  (ob¬ 
servational  versus  experimental),  and  the  sampling  method  by 
which  each  dataset  is  generated.  This  formal  encoding  will  al¬ 
low  us  to  delineate  the  inferences  that  one  can  draw  from  each 
design  to  answer  the  query  of  interest  (shown  at  the  top). 

Consider  the  task  of  predicting  the  distribution  of  outcomes 
Y  after  intervening  on  a  variable  X,  written  Q  =  P(Y  = 
y\do(X  =  x).  Assume  that  the  information  available  to  us 
comes  from  an  observational  study,  in  which  X,  Y,  Z,  and  W 
are  measured,  and  samples  are  selected  at  random.  We  ask 
for  conditions  under  which  the  query  Q  can  be  inferred  from 
the  information  available,  which  takes  the  form:  P(y,x,z,w), 
where  Z  and  W  are  sets  of  observed  covariates.  This  repre¬ 
sents  the  standard  task  of  policy  evaluation,  where  controlling 
for  confounding  bias  is  the  major  issue  (Task  1,  Fig.  1). 

Consider  now  Task  2  in  Fig.  1  in  which  the  goal  is  again  to 
estimate  the  effect  of  the  intervention  do( X  =  x)  but  the  data 
available  to  the  investigator  were  collected  in  an  experimental 
study  in  which  variable  Z ,  more  accessible  to  manipulation 
than  X,  is  randomized.  ( Instrumental  variables  are  special 
cases  of  this  task.)  The  general  question  in  this  scenario  is  un¬ 
der  what  conditions  can  randomization  of  variable  Z  be  used 
to  infer  how  the  population  would  react  to  interventions  over 
X.  Formally,  our  problem  is  to  infer  P(Y  =  y\do(X  =  x)) 
from  P(y,  x,  w\do(Z  =  z)).  A  complete  solution  to  these  two 
problems  will  be  presented  in  the  respective  policy  evaluation 
section. 

In  each  of  the  two  previous  tasks  we  assumed  that  a  perfect 
random  sample  from  the  underlying  population  was  drawn, 
which  may  not  always  be  realizable.  Task  3  in  Fig.  1 
represents  a  randomized  clinical  trial  conducted  on  a  non¬ 
representative  sample  of  the  population.  Here,  the  information 
available  takes  the  syntactic  form  P(y,  z,  w\do(X  =  x),  S  =  1), 
and  possibly  P(y,x,  z,w\S  =  1),  where  S'  is  a  sample  selec¬ 
tion  indicator,  with  S  =  1  indicating  inclusion  in  the  sample. 
The  challenge  is  to  estimate  the  effect  of  interest  from  this, 
far  from  ideal  sampling  condition.  Formally,  we  ask  when  the 
target  quantity  P(y\do(X  =  x))  is  derivable  from  the  available 
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information  (i.e.,  sampling-biased  distributions).  The  section 
of  sample  selection  bias  will  present  a  solution  to  this  problem. 

Finally,  the  previous  examples  assumed  that  the  popula¬ 
tion  from  which  data  were  collected  is  the  same  as  the  one 
for  which  inference  was  intended.  This  is  often  not  the  case 
(Task  4  in  Fig.  1).  For  example,  biological  experiments  often 
use  animals  as  substitutes  for  human  subjects.  Or,  in  a  less 
obvious  example,  data  may  be  available  from  an  experimen¬ 
tal  study  that  took  place  several  years  ago,  and  the  current 
population  has  changed  in  a  set  S  of  (possibly  unmeasured) 
attributes.  Our  task  then  is  to  infer  the  causal  effect  at  the  tar¬ 
get  population,  P(y\do(X  =  x),S  =  s)  from  the  information 
available,  which  now  takes  the  form:  P(y,  z,  w\do(X  =  x))  and 
P(y,  x,  z,w\S  =  s).  The  second  expression  represents  informa¬ 
tion  obtainable  from  non-experimental  studies  on  the  current 
population,  where  S  =  s. 

The  problems  represented  in  these  archetypal  examples  are 
known  as  confounding  bias  (Tasks  1,2),  sample  selection  bias 
(Task  3),  and  transportability  bias  (Task  4).  The  information 
available  in  each  of  these  tasks  is  characterized  by  a  different 
syntactic  form,  representing  a  different  “design”  and,  natu¬ 
rally,  each  of  these  designs  should  lead  to  different  inferences. 
What  we  shall  see  in  subsequent  sections  of  this  paper  is  that 
the  strategy  of  going  from  design  to  a  query  is  the  same  across 
tasks;  it  follows  simple  rules  of  inference  and  decides,  using 
syntactic  manipulations,  whether  the  type  of  data  available  is 
sufficient  for  the  task,  and,  if  so,  how. 

Empowered  by  this  strategy,  the  central  goal  of  this  paper 
will  be  to  explicate  the  conditions  under  which  causal  effects 
can  be  estimated  non-parametrically  from  multiple  heteroge¬ 
neous  datasets.  These  conditions  constitute  the  formal  basis 
for  many  big  data  inferences  since,  in  practice,  data  are  never 
collected  under  idealized  conditions,  ready  for  use.  The  re¬ 
maining  of  the  paper  is  organized  as  follows.  We  start  by 
defining  structural  causal  models  (SCMs)  and  stating  the  two 
fundamental  laws  of  causal  inference.  We  then  consider  re¬ 
spectively  the  problem  of  policy  evaluation  in  observational 
and  experimental  settings,  sampling  selection  bias,  and  data- 
fusion  from  multiple  populations. 


Counterfactuals  and  SCM 

At  the  center  of  the  structural  theory  of  causation  lies  a  “struc¬ 
tural  model,”  M,  consisting  of  two  sets  of  variables,  U  and  V . 
and  a  set  F  of  functions  that  determine  or  simulate  how  values 
are  assigned  to  each  variable  V)  £  V.  Thus,  for  example,  the 
equation 

Vi  =  fi(v,u ) 

describes  a  physical  process  by  which  variable  V)  is  assigned 
the  value  Vi  =  fi(v,  u )  in  response  to  the  current  values,  v  and 
u,  of  all  variables  in  V  and  U.  Formally,  the  triplet  <  U,V,  F  > 
defines  a  SCM,  and  the  diagram  that  captures  the  relation¬ 
ships  among  the  variables  is  called  the  causal  graph  G  (of 
M).  The  variables  in  U  are  considered  “exogenous,”  namely, 
background  conditions  for  which  no  explanatory  mechanism 
is  encoded  in  model  M.  Every  instantiation  U  =  u  of  the 
exogenous  variables  uniquely  determines  the  values  of  all  vari¬ 
ables  in  V  and,  hence,  if  we  assign  a  probability  P(u)  to  U,  it 
induces  a  probability  function  P(v)  on  V.  The  vector  U  =  u 
can  also  be  interpreted  as  an  experimental  “unit”  which  can 
stand  for  an  individual  subject,  agricultural  lot,  or  time  of 
day.  Conceptually,  a  unit  u  =  u  should  be  thought  of  as  the 
sum  total  of  all  relevant  factors  that  govern  the  behavior  of 
an  individual  or  experimental  circumstances. 


P(v)  P(v\do(z))  P(v  I S  -  1)+  Pi  v  \doix) )  + 

Pi  vl  doix).  5  =  1}  obversational  studies 


Fig.  1.  Prototypical  counterfactual  inferences  where  the  goal  is,  for  example,  to 
estimate  the  experimental  distribution  in  a  target  population  (shown  at  the  top).  Let 
V  =  {X,  Y,  Z,  Wy.  There  are  different  designs  (bottom)  showing  that  data  come 
from  non-idealized  conditions,  specifically:  (1)  from  the  same  population  under  an 
observational  regime,  P(v)\  (2)  from  the  same  population  under  an  experimental 
regime  when  Z  is  randomized,  P(v\do(z))]  (3)  from  the  same  population  under 
sampling  selection  bias,  P(v\S  =  1)  or  P(y\do(x),  S  =  1);  (4)  from  a  differ¬ 
ent  population  that  is  submitted  to  an  experimental  regime  when  X  is  randomized, 
P(v\do(x\  S  =  s),  and  observational  studies  in  the  target  population. 

The  basic  counterfactual  entity  in  structural  models  is  the 
sentence:  “Y  would  be  y  had  X  been  x  in  unit  (or  situation) 
U  =  u”  denoted  Yx(u)  =  y.  Letting  Mx  stand  for  a  modi¬ 
fied  version  of  M,  with  the  equation(s)  of  set  X  replaced  by 
X  =  x,  the  formal  definition  of  the  counterfactual  Yx(u)  reads 

Yx(u)=YMx(u\  [1] 

In  words,  the  counterfactual  Yx(u)  in  model  M  is  defined  as 
the  solution  for  Y  in  the  “modified”  submodel  Mx.  [1]  and  [2] 
have  given  a  complete  axiomatization  of  structural  counterfac¬ 
tuals,  embracing  both  recursive  and  non-recursive  models  (see 
also  [3,  Chapter  7]).  Remarkably,  the  axioms  that  charac¬ 
terize  counterfactuals  in  SCM  coincide  with  those  that  govern 
potential  outcomes  in  Rubin’s  causal  model  [5]  where  Yx(u) 
stands  for  the  potential  outcome  of  unit  u,  had  u  been  assigned 
treatment  A'  =  x.  This  axiomatic  agreement  implies  a  logical 
equivalence  of  the  two  systems,  namely  any  valid  inference  in 
one  is  also  valid  in  the  other.  The  advantages  of  SCMs  lies  in 
their  transparency  and  representational  effectiveness  [6]. 

Eq.  (1)  implies  that  the  distribution  P(u)  induces  a  well  de¬ 
fined  probability  on  the  counterfactual  event  Yx  =  y,  written 
P(YX  =  y),  which  is  equal  to  the  probability  that  a  random 
unit  u  would  satisfy  the  equation  Yx(u )  =  y.  By  the  same  rea¬ 
soning,  the  model  <  U,V,F,P(u)  >  assigns  a  probability  to 
every  counterfactual  or  combination  of  counterfactuals  defined 
on  the  variables  in  V. 

The  two  principles  of  causal  inference.  Before  describing  how 
the  structural  theory  applies  to  big  data  inferences,  it  will  be 
useful  to  summarize  its  implications  in  the  form  of  two  “prin¬ 
ciples,”  from  which  all  other  results  follow. 

Principle  1:  “The  law  of  structural  counterfactuals.” 

Principle  2:  “The  law  of  structural  independences.” 


The  structural  definition  of  counterfactual  given  in  Eq.  (1)  was  first  introduced  in  [4], 

2  By  a  path  we  mean  a  consecutive  edges  in  the  graph  regardless  of  direction.  Dependencies  among 
the  U  variables  are  represented  by  double-arrowed  arcs,  as  in  Fig.  3  below. 


2  j  www.pnas.org/cgi/doi/10.1073/pnas.0709640104 
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The  first  principle  is  described  in  Eq.  (1)  and  instructs  us 
how  to  compute  counterfactuals  and  probabilities  of  counter- 
factuals  from  a  structural  model.  This  instruction  constitutes 
the  semantics  of  counterfactuals,  which  determines  how  coun- 
terfactual  variables  are  related  to  observed  variables  and,  con¬ 
versely,  how  the  observed  data  influence  the  causal  parameters 
that  we  aim  to  estimate. 

Principle  2  defines  how  structural  features  of  the  model  en¬ 
tail  constraints  (e.g.,  conditional  independencies)  readable  in 
the  data.  Remarkably,  regardless  of  the  functional  form  of 
the  equations  in  the  model  (F)  and  regardless  of  the  distribu¬ 
tion  of  the  exogenous  variables  (U),  if  the  model  is  recursive, 
the  distribution  P(v)  of  the  endogenous  variables  must  obey 
certain  conditional  independence  relations,  stated  roughly  as 
follows:  whenever  sets  X  and  Y  are  separated  by  a  set  Z  in 
the  graph,  X  is  independent  of  Y  given  Z  in  the  probability 
distribution.  This  “separation”  condition,  called  d-separation 
[7,  pp.  16-18],  constitutes  the  link  between  the  causal  assump¬ 
tions  encoded  in  the  graph  (in  the  form  of  missing  arrows)  and 
the  observed  data. 

Definition  1.  (d-separation) 

A  set  Z  of  nodes  is  said  to  block  a  path  p  if  either 

1.  p  contains  at  least  one  arrow- emitting  node  that  is  in  Z 
(i.e.,  —Z  — > ),  or 

2.  p  contains  at  least  one  collision  node  that  is  outside  Z 
(i.e.,  — *  Z  <—)  and  has  no  descendant  in  Z. 

If  Z  blocks  all  paths  from  set  X  to  set  Y,  it  is  said  to  “d- 
separate  X  and  Y,  ”  and  then,  variables  X  and  Y  are  indepen¬ 
dent  given  Z,  written  XALY\Z.  2 

D-separation  implies  conditional  independencies  for  every 
distribution  P(v)  that  can  be  generated  by  assigning  func¬ 
tions  (F)  to  the  variables  in  the  graph.  To  illustrate,  the 
diagram  in  Fig.  2(a)  implies  ZiALY |{X,  Z3,  W2},  because  the 
conditioning  set  Z  =  {X,  Z3,  W 2 }  blocks  all  paths  beween  Z\ 
and  Y.  The  set  Z  =  {X,  Zz,Ws(}  however  leaves  the  path 
(Zi  — >  Z3  <—  Z2  — >  W2  — >  Y)  unblocked  (by  virtue  of  the 
converging  arrows  (collider)  at  Z3)  and,  so,  the  independence 
ZiALY\(X,  Z3,  W3)  is  not  implied  by  the  diagram. 

In  the  sequel,  we  show  how  these  independencies  help  us 
evaluate  the  effect  of  interventions  and  overcome  the  problem 
of  confounding  bias.  3  Clearly,  any  attempt  to  predict  the  ef¬ 
fects  of  interventions  from  non-experimental  data  must  rely  on 
causal  assumptions.  One  of  the  most  attractive  features  of  the 
SCM  framework  is  that  those  assumptions  are  all  encoded  par¬ 
simoniously  in  the  diagram,  thus,  unlike  “ignorability”-type 
assumptions  [8,  9],  they  can  be  meaningfully  scrutinized  for 
scientific  plausibility  or  be  submitted  to  statistical  tests. 


Policy  Evaluation  and  the  Problem  of  Confounding 

A  central  question  in  causal  analysis  is  that  of  predicting  the 
results  of  interventions,  such  as  those  resulting  from  medical 
treatments  or  social  programs,  which  we  denote  by  the  symbol 
do(x)  and  define  using  the  counterfactual  Yx  as4 

P(y\do{x ))  =  P(YX  =  y)  [2] 

Figure  2(b)  illustrates  the  submodel  Mx  created  by  the  atomic 
intervention  do(x);  it  sets  the  value  of  X  to  x  and  thus  removes 
the  influence  (arrow)  of  {Wi,  Z3}  on  X.  The  set  of  incoming 
arrows  towards  X  is  sometimes  called  the  assignment  mech¬ 
anism,  and  may  also  represent  how  the  decision  X  =  x  is 
made  by  an  individual  in  response  to  natural  predilections 
(i.e.,  {Wi,  Z3}),  as  opposed  to  an  externally  imposed  assign¬ 


ment  in  a  controlled  experiment.  5  Furthermore,  we  can  sim¬ 
ilarly  define  the  result  of  stratum-specific  interventions  by 

P(y\do(x),z)  =  P(y,z\do(x))/P(z\do{x))  =  P(YX  =  y\Zx  =  z) 

[3] 

P(y\do(x),  z)  captures  the  2-specific  effect  of  X  on  1,  that 
is,  Y’s  response  to  setting  X  to  x  among  those  units  only  for 
which  Z  responds  with  2.  (For  pre-treatment  Z  (e.g.,  sex, 
age,  or  ethnicity),  those  units  would  remain  invariant  to  X 
(i.e.,  Zx  =  Z).  ) 

Recalling  that  any  counterfactual  quantity  can  be  computed 
from  a  fully  specified  model  <  U,  V,  F,  P(u)  >,  it  follows  that 
the  interventional  distributions  defined  in  Eq.  (2)  and  Eq.(3) 
can  be  computed  directly  from  such  a  model.  In  practice, 
however,  only  a  partially  specified  model  is  available,  in  the 
form  of  a  graph  G,  and  the  problem  arises  whether  the  data 
collected  can  make  up  for  our  ignorance  of  the  functions  F 
and  the  probabilities  P(it).  This  is  the  problem  of  identi¬ 
fication,  which  asks  whether  the  interventional  distribution, 
P(y\do(x)),  can  be  estimated  from  the  available  data  and  the 
assumptions  embodied  in  the  causal  graph. 

In  parametric  settings,  the  question  of  identification  reduces 
to  asking  whether  some  model  parameter,  9,  has  a  unique  so¬ 
lution  in  terms  of  the  parameters  of  P.  In  the  nonparametric 
formulation,  the  notion  of  “has  a  unique  solution”  does  not 
directly  apply  since  quantities  such  as  Q  =  P(y\do(x))  have 
no  parametric  signature  and  are  defined  procedurally  by  a 
symbolic  operation  on  the  causal  model  M  (as  in  Fig.  2(b)). 
The  following  definition  captures  the  requirement  that  Q  be 
estimable  from  the  observed  distribution  and  the  causal  graph: 
Definition  2.  (Identifiability)  [7,  p.  77] 

A  causal  query  Q  is  identifiable  from  distribution  P(v)  com¬ 
patible  with  a  causal  graph  G,  if  for  any  two  (fully  specified) 
models  M 1  and  M2  that  satisfy  the  assumptions  in  G,  we  have 

Pi(v)  =  Pi(v)  =>  Q(M!)  =  Q{M2)  [4] 

In  words,  equality  in  the  probabilities  Pi(v)  and  p2(v)  in¬ 
duced  by  models  Mi  and  M2,  respectively,  entails  equality  in 
the  answers  that  these  two  models  give  to  query  Q.  When  this 
happens,  Q  depends  on  P{v)  and  G  only,  and  can  therefore  be 
expressible  in  terms  of  the  parameters  of  P{v)  (i.e.,  regardless 
of  the  true  underlying  mechanisms  F  and  randomness  P(u)). 

For  queries  in  the  form  of  a  do-expression,  for  example 
Q  =  P(y\do(x),  2),  identifiability  can  be  decided  systemati¬ 
cally  using  an  algebraic  procedure  known  as  the  do-calculus 
[11],  to  be  discussed  next.  It  consists  of  three  inference  rules 
that  permit  us  to  manipulate  interventional  and  observational 
distributions  whenever  certain  separation  conditions  hold  in 
the  causal  diagram  G. 

The  rules  of  do-calculus.  Let  X,  Y,  Z,  and  W  be  arbitrary 
disjoint  sets  of  nodes  in  a  causal  DAG  G.  We  denote  by  G 
the  graph  obtained  by  deleting  from  G  all  arrows  pointing  to 
nodes  in  X  (e.g.,  Fig.  2(b)).  Likewise,  we  denote  by  Gx  the 
graph  obtained  by  deleting  from  G  all  arrows  emerging  from 
nodes  in  X  (e.g.,  Fig.  2(c)).  To  represent  the  deletion  of  both 
incoming  and  outgoing  arrows,  we  use  the  notation  Gj^z. 

The  following  three  rules  are  valid  for  every  interventional 
distribution  compatible  with  G. 

Rule  1  (Insertion/deletion  of  observations): 

P(y\do(x),z,  w)  =  P(y\do(x),w)  if  (YALZ\X,  W)G_  [5] 


3 These  and  other  constraints  implied  by  Principle  1  also  facilitate  model  testing  and  learning  [7]. 
■^Alternative  definitions  of  do(x)  invoking  population  averages  only  are  given  in  [3,  p.  24]  and  [10], 
which  are  also  compatible  with  the  results  presented  in  this  paper. 

5This  primitive  operator  can  be  used  for  handling  stratum-specific  interventions  [3,  Ch.  4]  as  well 
as  to  model  issues  of  non-compliance  [3,  Ch.  8]  and  compound  interventions  [3,  Ch.  11.4], 
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Rule  2  (Action/observation  exchange): 

P{y\do(x),do(z),w)  =  P(y\do(x),z,w)  if  (YJLZ\X,  W)G ^ 

Rule  3  (Insertion/deletion  of  actions): 

P(y\do(x),do(z),w)  =  P(y\do(x),  w)  if  {YALZ \X,  W)a [7] 

where  Z*  is  the  set  of  Z-nodes  that  are  not  ancestors  of  any 
IF-node  in  Gy. 

To  establish  identifiability  of  a  causal  query  Q,  one  needs 
to  repeatedly  apply  the  rules  of  do-calculus  to  Q,  until  an  ex¬ 
pression  is  obtained  which  no  longer  contains  a  do-operator6; 
this  renders  it  estimable  from  nonexperimental  data.  The  do- 
calculus  was  proven  to  be  complete  for  queries  in  the  form 
Q  =  P(y\do(x),  z)  [12,  13],  which  means  that  if  Q  cannot  be 
reduced  to  probabilities  of  observables  by  repeated  applica¬ 
tion  of  these  three  rules,  Q  is  not  identifiable.  We  show  next 
concrete  examples  of  the  application  of  the  do-calc.ulus. 

Covariate  selection:  the  back-door  criterion 

Consider  an  observational  study  where  we  wish  to  find  the  ef¬ 
fect  of  treatment  (X)  on  outcome  ( Y ),  and  assume  that  the 
factors  deemed  relevant  to  the  problem  are  structured  as  in 
Fig.  2(a);  some  are  affecting  the  outcome,  some  are  affect¬ 
ing  the  treatment,  and  some  are  affecting  both  treatment  and 
response.  Some  of  these  factors  may  be  unmeasurable,  such 
as  genetic  trait  or  lifestyle,  while  others  are  measurable,  such 
as  gender,  age,  and  salary  level.  Our  problem  is  to  select  a 
subset  of  these  factors  for  measurement  and  adjustment  such 
that  if  we  compare  treated  vs.  untreated  subjects  having  the 
same  values  of  the  selected  factors,  we  get  the  correct  treat¬ 
ment  effect  in  that  subpopulation  of  subjects.  Such  a  set  of 
factors  is  called  a  “sufficient  set,”  “admissible  set”  or  a  set 
“appropriate  for  adjustment”  (see  [14,  6]).  The  following  cri¬ 
terion,  named  “back-door”  [15],  provides  a  graphical  method 
of  selecting  such  a  set  of  factors  for  adjustment. 

Definition  3.  (admissible  sets — the  back-door  criterion) 

A  set  Z  is  admissible  (or  “sufficient” )  for  estimating  the  causal 
effect  of  X  on  Y  if  two  conditions  hold: 

1.  No  element  of  Z  is  a  descendant  of  X. 

2.  The  elements  of  Z  “block”  all  “back-door”  paths  from  X  to 
Y  -  i.e.,  all  paths  that  end  with  an  arrow  pointing  to  X. 

Based  on  this  criterion  we  see,  for  example  that,  in  Fig.  2, 
the  sets  {Zi,  Z2,  Z3},  {Z3,Z3},  {Wi,Z3},  and  {W2,Z3j  are 
each  sufficient  for  adjustment,  because  each  blocks  all  back¬ 
door  paths  between  X  and  Y .  The  set  {Z3},  however,  is  not 
sufficient  for  adjustment  because  it  does  not  block  the  path 
X  <-  Wi  <-  Zi  ->  X3  <-  X2  ->  W2  -►  Y. 

The  intuition  behind  the  back-door  criterion  is  simple.  The 
back-door  paths  in  the  diagram  carry  the  “spurious  associa¬ 
tions”  from  X  to  Y,  while  the  paths  directed  along  the  arrows 
from  X  to  Y  carry  causative  associations.  If  we  remove  the 
latter  paths  as  shown  in  Fig.  2(c),  checking  whether  A'  and 
Y  are  separated  by  Z  amounts  to  verifying  that  Z  blocks  all 
spurious  paths.  This  ensures  that  the  measured  association 
between  X  and  Y  is  purely  causal,  namely,  it  correctly  rep¬ 
resents  the  causal  effect  of  A'  on  Y.  Conditions  for  relaxing 
restriction  1  are  given  in  [3,  p.  338][16,  IT]7. 

The  implication  of  finding  a  sufficient  set,  Z,  is  that  stratify¬ 
ing  on  Z  is  guaranteed  to  remove  all  confounding  bias  relative 
to  the  causal  effect  of  X  on  Y .  In  other  words,  it  renders  the 
effect  of  X  on  Y  identifiable,  via  the  adjustment  formula8 

P(Y  =  y\do(X  =X))  =  J2  p(v\x,  Z  =  z)P(Z  =  z)  [8] 


Fig.  2.  (a)  Graphical  model  illustrating  d-separation  and  the  back-door  criterion. 

U  terms  are  not  shown  explicitly,  (b)  Illustrating  the  intervention  do(X  =  x)  with 
arrows  towards  X  cut.  (c)  Illustrating  the  spurious  paths,  which  pop  out  when  we  cut 
the  outgoing  edges  from  X ,  and  need  to  be  blocked  if  one  wants  to  use  adjustment. 

Since  all  factors  on  the  right-hand  side  of  the  equation  are  es¬ 
timable  (e.g.,  by  regression)  from  non-experimental  data,  the 
causal  effect  can  likewise  be  estimated  from  such  data  with¬ 
out  bias.  Eq.  (8)  differs  from  the  conditional  distribution  of 
Y  given  X,  which  can  be  written  as 

P{Y  =  y\X  =  x)  =  J^  P(y\x,  Z  =  z)P(Z  =  z\ x)f  [9] 

S 

the  difference  between  these  two  distributions  defines  con¬ 
founding  bias. 

Moreover,  the  back-door  criterion  implies  an  independence 
known  as  “conditional  ignorability”  [8],  XALYx\Z,  and  pro¬ 
vides  therefore  the  scientific  basis  for  most  inferences  in  the 
potential  outcome  framework.  For  example,  the  set  of  covari¬ 
ates  that  enter  “propensity  score”  analysis  [8]  must  constitute 
a  back-door  sufficient  set,  else  confounding  bias  will  arise. 

The  back-door  criterion  can  be  applied  systematically  to 
diagrams  of  any  size  and  shape,  thus  freeing  analysts  from 
judging  whether  “A  is  conditionally  ignorable  given  Z,”  a 
formidable  mental  task  required  in  the  potential-outcome 
framework.  The  criterion  also  enables  the  analyst  to  search  for 
an  optimal  set  of  covariates — namely,  a  set,  Z,  that  minimizes 
measurement  cost  or  sampling  variability  [18,  19]. 

Despite  its  importance,  adjustment  for  covariates  (or  for 
propensity  scores)  is  only  one  tool  available  for  estimating  the 
effects  of  interventions  in  observational  studies;  more  refined 
strategies  exist  which  go  beyond  adjustment.  For  instance, 
assume  that  only  variables  {X,Y,W3}  are  observed  in  Fig. 
2(a),  so  only  the  observational  distribution  P(x,y,w3)  may 
be  estimated  from  the  samples.  In  this  case,  conditional  ig¬ 
norability  does  not  hold,  but  an  alternative  strategy  known  as 
the  front-door  criterion  [7,  pp.  83]  can  be  employed  to  yield 
identification.  Specifically,  the  calculus  permits  rewriting  the 
experimental  distribution  as: 

P{Y  =  y\do(X  =  x))  =  P(u>3\x)  ^2  p(y\x,  w3)P(x'), 

ws  x ’ 

[10] 

which  is  almost  always  different  than  Eq.  (8). 

Finally,  in  case  W3  is  also  not  observed,  only  the  observa¬ 
tional  distribution  P(x,  y)  can  be  estimated  from  the  samples, 
and  the  calculus  will  discover  that  no  reduction  is  feasible, 
which  implies  (by  virtue  of  its  completeness)  that  the  target 
quantity  is  not  identifiable  (without  further  assumptions). 

Identification  through  Auxiliary  Experiments 

In  many  applications,  it  is  not  uncommon  that  the  quantity 
Q  =  P(y\do(x))  is  not  identifiable  from  the  observational  data 


6Such  derivations  are  illustrated  in  graphical  details  in  [3,  p.  87]  and  in  the  next  section. 

7ln  particular,  the  criterion  devised  by  [17]  simply  adds  to  Condition  2  of  Definition  3  the  require- 
ment  that  X  and  its  nondescendants  (in  Z)  separate  its  descendants  (in  Z)  from  Y. 
^Summations  should  be  replaced  by  integration  when  applied  to  continuous  variables. 


[6] 
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alone.  Imagine  a  researcher  interested  in  assessing  the  effect 
( Q )  of  cholesterol  levels  (A')  on  heart  disease  (Y),  assuming 
data  about  subjects  diet  (Z)  is  also  collected  (Fig.  3(a)).  In 
practice,  it  is  infeasible  to  control  subjects  cholesterol  level  by 
intervention,  so  P(y\do(x))  cannot  be  obtained  from  a  ran¬ 
domized  trial.  Assuming,  however,  that  an  experiment  can  be 
conducted  in  which  Z  is  randomized,  would  Q  be  computable 
given  this  additional  piece  of  experimental  information? 

This  question  represents  what  we  called  Task  2  in  Fig.  1, 
and  leads  to  a  natural  extension  of  the  identifiability  problem 
(def.  2)  in  which,  in  addition  to  the  standard  input  ( P(v )  and 
G),  an  interventional  distribution  P(v\do(z))  is  also  available 
to  help  establishing  Q  =  P(y\do(x)).  This  task  can  be  seen  as 
the  non-parametric  version  of  identification  with  instrumental 
variables  and  was  named  2-identification  in  [20]. 

Using  the  do-calculus  and  the  assumptions  embedded  in  Fig. 
3(a),  it  can  readily  be  shown  that  the  target  query  Q  can  be 
transformed  to  read: 

P(Y  =  y\do(X  =  x))  =  P(y,x\do(z))/P(x\do(z)),  [11] 

for  any  level  Z  =  z.  Since  all  do-terms  in  Eq.  (11)  apply  only 
to  Z,  Q  is  estimable  from  the  available  data.  In  general,  it 
can  be  shown  [20]  that  z-identifiability  is  feasible  if  and  only 
if  X  intercepts  all  directed  paths  from  Z  to  Y  and  P(y\do(x)) 
is  identifiable  in  Gy. 

Fig.  3(b)  demonstrates  this  graphical  criterion.  Here  Z\ 
can  serve  as  auxiliary  variable  because,  (1)  there  is  no  di¬ 
rected  path  from  Z\  to  Y  in  Gy,  and,  (2)  Z2  is  a  sufficient  set 
in  Gyp  The  resulting  expression  for  Q  becomes: 

P(Y  =  y\do(X  =  x))  =  y^/  P(y\x,  do(zi),  Z2)P{z2\x ,  21) 

Zl 

[12] 

The  first  factor  is  estimable  from  the  experimental  dataset 
and  the  second  factor  from  the  observational  dataset  (e.g.,  by 
regression-based  methods). 

Fig.  3(c)  and  (d)  demonstrate  negative  examples  in  which  Q 
is  not  estimable  even  when  both  distributions  (observational 
and  experimental)  are  available;  each  model  violates  the  nec¬ 
essary  conditions  stated  above. 

Summary  Result  1.  (Identification  in  Policy  Evaluation)  The  anal¬ 
ysis  of  policy  evaluation  problems  has  reached  a  fairly  satis¬ 
factory  state  of  maturity.  We  now  possess  a  complete  solution 
to  the  problem  of  identification,  entailing  the  following: 

•  both  graphical  and  algorithmic  criteria  for  deciding  identi¬ 
fiability; 

•  automated  procedures  for  extracting  each  and  every  identi¬ 
fiable  estimand; 

•  extensions  that  apply  to  models  of  any  size  or  shape,  in¬ 
cluding  those  invoking  sequential  dynamic  decisions  with 
unmeasured  confounders. 

These  results  were  developed  in  several  stages  over  the  past  20 
years  [15,  11,  21,  13,  20]. 

Sample  Selection  Bias 

In  this  section,  we  consider  the  bias  associated  with  the  data- 
gathering  process,  as  opposed  to  confounding  bias  that  is  as¬ 
sociated  with  the  treatment  assignment  mechanism.  Sample 
selection  bias  (or  selection  bias  for  short)  is  induced  by  pref¬ 
erential  selection  of  units  for  data  analysis,  usually  governed 
by  unknown  factors  including  treatment,  outcome,  and  their 
consequences,  and  represents  a  major  obstacle  to  valid  statis¬ 
tical  and  causal  inferences.  For  instance,  in  a  typical  study  of 


(a)  (b)  (c)  (d) 

Fig.  3.  Graphical  models  illustrating  identification  of  Q  =  P(y\do(x))  through 
the  use  of  experiments  over  an  auxiliary  variable  Z.  Identifiability  follows  from 
P{x,y\do{Z  =  2))  in  (a),  and  it  also  requires  P(v)  in  (b).  Identifiability  of 
Q  fails  in  (c)  because  Q  is  not  identifiability  in  Gy,  and  is  also  not  possible  in  (d) 
because  there  is  a  directed  path  not  blockable  by  X  from  Z  to  Y. 

the  effect  of  training  program  on  earnings,  subjects  achieving 
higher  incomes  tend  to  report  their  earnings  more  frequently 
than  those  who  earn  less,  resulting  in  biased  inferences. 

Selection  bias  challenges  the  validity  of  inferences  in  several 
tasks  in  Artificial  Intelligence  [22,  23]  and  Statistics  [24,  25] 
as  well  as  in  the  empirical  sciences  (e.g.,  Genetics  [26,  27], 
Economics  [28,  29],  and  Epidemiology  [30,  31]). 

To  illustrate  the  nature  of  preferential  selection,  consider 
the  data-generating  model  in  Fig.  4(a)  in  which  A'  represents 
a  treatment,  Y  represents  an  outcome,  and  S'  is  a  special  (in¬ 
dicator)  variable  representing  entry  into  the  data  pool  -  S  =  1 
means  that  the  unit  is  in  the  sample,  S  =  0  otherwise.  If  our 
goal  is,  for  example,  to  compute  the  population-level  experi¬ 
mental  distribution  Q  =  P(y\do(x)),  and  the  samples  available 
are  collected  under  preferential  selection,  only  P(y ,  a:|S  =  1)  is 
accessible  for  use.  Under  what  conditions  can  Q  be  recovered 
from  data  available  under  selection  bias? 

In  the  model  G  in  Fig.  4(b)  the  selection  process  is 
treatment-dependent  (i.e.,  X  — >  S),  and  the  selection  mecha¬ 
nism  S  is  separated  from  Y  by  A',  hence,  P(y\x )  =  P(y\x,  S  = 
1).  Moreover,  given  that  X  and  Y  are  unconfounded,  we 
can  rewrite  the  l.h.s.  as  P(y\x)  =  P(y\do(x)),  it  follows 
that  the  experimental  distribution  is  recoverable  and  given 
by  P{y\do(x))  =  P(y \x,  5  =  1)  [32].  On  the  other  hand,  if  the 
selection  process  is  also  outcome-dependent  (Fig.  4(a)),  5  is 
not  separable  from  Y  by  X  in  G,  and  Q  is  not  recoverable  by 
any  method  (without  stronger  assumptions)  [33]. 

In  practical  settings,  however,  the  data-gathering  process 
may  be  embedded  in  more  intricate  scenarios  as  shown  in  Fig. 
4(c-f),  where  covariates  such  as  age,  sex,  socio-economic  sta¬ 
tus  also  affect  the  sampling  probabilities.  In  the  model  in 
Fig.  4(c),  for  example,  W\  (sex)  is  a  driver  of  the  treatment 
while  also  affecting  the  sampling  process.  In  this  case,  both 
confounding  and  selection  biases  need  to  be  controlled  for.  We 
can  see  based  on  Def.  3  that  {Wi,  W2},  {Wi,  W2 ,  Z},  {Wi,  Z}, 
{W2,  Z},  and  {Zj  are  all  back-door  admissible  sets,  so  proper 
for  controlling  confounding  bias.  However,  only  the  set  {Zj 
is  appropriate  for  controlling  for  selection  bias.  The  reason  is 
that  when  using  the  adjusting  formula  (Eq.  (8))  with  any  set, 
say  T,  the  prior  distribution  P(t)  also  needs  to  be  estimable, 
which  is  clearly  not  feasible  for  sets  different  than  {Zj  (the 
only  set  independent  of  5).  The  proper  adjustment  in  this  case 
would  be  written  as  P(y\do(x))  =  Y ]z  P{y I®,  z,S  =  1)P(2|5  = 
1),  where  both  factors  are  estimable  from  the  biased  dataset. 

If  we  apply  the  same  rationale  to  Fig.  4(d)  and  search  for 
a  set  Z  that  is  both  admissible  for  adjustment  and  also  avail¬ 
able  from  the  biased  dataset,  we  will  fail.  In  a  big  data  reality, 
however,  additional  datasets  with  measurements  at  the  popu- 


9  These  conditions  extend  the  backdoor  criterion  to  allow  descendants  of  X  to  be  part  of  Z  [17]. 
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lation  level  (over  subsets  of  the  variables)  may  be  available  to 
help  computing  these  effects.  For  instance,  P(age,  sex,  race) 
is  usually  estimable  from  census  data  without  selection  bias. 

Definition  4  (below)  provides  a  simple  extension  of  the  back¬ 
door  condition  which  allow  us  to  control  both  selection  and 
confounding  biases  by  the  following  formula  [33]: 

P(y\do(x))  =  ^P{y\x,z,S  =l)P{z),  [13] 

Z 

where  Z  is  a  set  of  covariates  that  obeys  four  conditions.  Con¬ 
ditions  (i-ii)  assure  that  Z  is  backdoor  admissible  9,  condition 
(iii)  acts  to  separate  the  sampling  mechanism  S  from  Y,  and 
condition  (iv)  guarantees  that  Z  is  measured  in  both  popula¬ 
tion  level  data  and  biased  data. 

Definition  4.  (Selection-backdoor  criterion  [33])  Let  a  set  Z  of 

variables  be  partitioned  into  Z+  U  Z~  such  that  Z+  contains 
all  non-descendants  of  X  and  Z~  the  descendants  of  X .  Z 
is  said  to  satisfy  the  selection  backdoor  criterion  (s-backdoor, 
for  short)  relative  to  an  ordered  pairs  of  variables  (X,Y)  and 
an  ordered  pair  of  sets  ( M ,  T)  in  a  graph  Gs  if  Z+  and  Z~ 
satisfy  the  following  conditions: 

(i)  Z+  blocks  all  back  door  paths  from  X  to  Y ; 

(ii)  X  and  Z+  block  all  paths  between  Z~  and  Y,  namely, 
(Z~JLY\X,Z+); 

(iii)  X  and  Z  block  all  paths  between  S  and  Y,  namely, 
(' YALS\X,Z ); 

(iv)  Z  U  {X,  Y}  C  M ,  and  Z  C  T,  where  M  and  T  represent 
measurements  taken  in  the  biased  and  unbiased  studies,  re¬ 
spectively. 

To  illustrate  the  use  of  this  criterion,  note  that  any  one  of  the 
sets  {Ti,  Z3},{Zi,Z3},{Z2,  Z3},{W2,  Z3j  in  Fig.  4(d)  satis¬ 
fies  conditions  (i)-  (ii)  of  Def.  4.  However,  the  first  three  sets 
clearly  do  not  satisfy  condition  (iii),  but  { W2,Z3}  does  (since 
YALS\{W2,Z3}  in  G ).  If  census  data  are  available  with  mea¬ 
surements  of  {W2,  Z3},  condition  (iv)  will  be  satisfied,  and  the 
experimental  distribution  P(y\do(x))  is  estimable  through  the 
expression  ^ZW2  Z3  P{y\x,w2,  z3,  S  =  l)P(w2,z3).  The  first 
factor  can  be  estimated  from  the  biased  dataset  and  the  sec¬ 
ond  from  the  population  level  (unbiased)  dataset. 

We  note  that  s-backdoor  is  a  sufficient  though  not  necessary 
condition  for  recoverability.  In  Fig.  4(e),  for  example,  condi¬ 
tion  (i)  is  never  satisfied.  Nevertheless,  a  do-calculus  deriva¬ 
tion  allows  for  the  estimation  of  the  experimental  distribu¬ 
tion  even  without  an  unbiased  dataset  [37] ,  leading  to  the  ex¬ 
pression  52wl(P(v\S  =  l)/P{w2\wi,S  =  l))/Y,y,wl(P(v\S  = 
l)/P(w2\wi,S  =  1)),  for  any  level  W2  =  w2. 

The  generalizabiity  of  clinical  trials.  The  simple  model  of  Fig. 
4(f)  illustrates  a  common  pattern  that  assists  in  generaliz¬ 
ing  experimental  findings  from  clinical  trials.  In  such  trials, 
confounding  need  not  be  controlled  for  and  the  major  task  is 
to  generalize  from  non-representative  samples  ( S  =  1)  to  the 
population  at  large. 

This  disparity  is  indeed  a  major  threat  to  the  validity  of 
randomized  trials.  Since  participation  cannot  be  mandated, 
we  cannot  guarantee  that  the  study  population  would  be  the 
same  as  the  population  of  interest.  Specifically,  the  study  pop¬ 
ulation  may  consist  of  volunteers,  who  respond  to  financial  and 
medical  incentives  offered  by  pharmaceutical  firms  or  exper¬ 
imental  teams,  so,  the  distribution  of  outcomes  in  the  study 
may  differ  substantially  from  the  distribution  of  outcomes  un¬ 
der  the  policy  of  interest. 

Bearing  in  mind  that  we  are  in  a  big  data  context,  it  is  not 
unreasonable  to  assume  that  both  P(y,  z\do(x),  S  =  1)  and 


Fig.  4.  Canonical  models  where  selection  is  treatment-dependent  in  (a,b)  and  also 
outcome-dependent  in  (a).  More  complex  models  in  which  {Wi,  W2 }  and  {.Z}  are 
sufficient  for  adjustment,  but  only  the  latter  is  adequate  for  recovering  from  selection 
bias  (c).  There  is  no  sufficient  set  for  adjustment  without  external  data  in  (d,e,f).  (d) 
Example  of  s-backdoor  admissible  set.  (e,f)  Structures  with  no  s-admissible  sets  that 
require  more  involved  recoverability  strategies  involving  post-treatment  variables. 


P(x,  z,  y)  are  available,  and  the  following  derivation  shows  how 
the  target  query  in  the  model  of  Fig.  4(f)  can  be  transformed 
to  match  these  two  datasets: 

P(y\do(x))  =  '^P(y\do{x),z)P{z\do{x)) 

Z 

=  5 Zp(y\do(x),z)p{z\x ) 

Z 

=  ^2P(y\do{x),z,S  =  l)P{z\x)  [14] 

z 

The  two  factors  in  the  final  expression  are  estimable  from 
the  available  data;  the  first  from  the  trial’s  (biased)  dataset, 
and  the  second  from  the  population  level  dataset. 

This  example  demonstrates  the  important  role  that  post¬ 
treatment  variables  (Z)  play  in  facilitating  generalizations 
from  clinical  trials.  Previous  analyses  [9,  34,  35]  have  in¬ 
variably  relied  on  an  assumption  called  “S-ignorability,”  i.e., 
YXALZ\S,  which  states  that  the  potential  outcome  Yx  is  inde¬ 
pendent  of  the  selection  mechanism  S  in  every  stratum  Z  =  z. 
When  Z  satisfies  this  assumption,  generalizabiity  can  be  ac¬ 
complished  by  reweighing  (or  recalibrating)  P{z).  Recently, 
however,  it  was  shown  that  s-ignorability  is  rarely  satisfied  by 
post-treatment  variables  and,  even  when  it  does,  reweighting 
will  not  give  the  correct  result  [36].  10 

The  derivation  of  Eq.  (14)  demonstrates  that  post¬ 
treatment  variables  can  nevertheless  be  leveraged  for  the  task 
albeit  through  non-conventional  re-weighting  formulas,  which 
can  be  derived  systematically  by  the  do-calculus  [37]. 
Summary  Result  2.  (Recoverability  from  Selection  Bias) 

•  The  s-backdoor  criterion  (Def.  4)  provides  a  sufficient  con¬ 
dition  for  simultaneous  recovery  from  both  confounding  and 
sampling  selection  bias. 

•  In  clinical  trials,  causal  effects  can  be  recovered  from  se¬ 
lection  bias  through  systematic  derivations  in  do-calculus, 
leveraging  both  pre-treatment  and  post-treatment  variables. 

•  More  powerful  recoverability  methods  have  been  developed 
for  special  classes  of  models  [38,  33,  37]. 


11 1 n  general,  the  language  of  ignorability  is  too  coarse  for  handling  post-treatment  variables. 
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Transportability  and  the  Problem  of  Data-fusion 

In  this  section,  we  consider  Task  4  (Fig.  1),  the  problem  of 
extrapolating  experimental  findings  across  domains  (i.e.,  set¬ 
tings,  populations,  environments)  that  differ  both  in  their  dis¬ 
tributions  and  their  inherent  causal  characteristics.  This  prob¬ 
lem,  called  “transportability”  in  [39],  lies  at  the  heart  of  ev¬ 
ery  scientific  investigation  since,  invariably,  experiments  per¬ 
formed  in  one  environment  are  intended  to  be  used  elsewhere, 
where  conditions  are  likely  to  be  different.  Special  cases  of 
transportability  can  be  found  in  the  literature  under  different 
rubrics  such  as  “external  validity”  [40,  41],  “heterogeneity” 
[42]),  “quasi-experiments”  [43,  Ch.  3];  [44].  We  formalize  the 
transportability  problem  in  non-parametric  settings  and  show 
that  despite  glaring  differences  between  the  two  populations,  it 
might  still  be  possible  to  infer  causal  effects  at  the  target  pop¬ 
ulation  by  borrowing  experimental  knowledge  from  the  source 
populations. 

For  instance,  assume  our  goal  is  to  infer  the  casual  effect 
at  one  population  from  experiments  conducted  in  a  different 
population  after  noticing  that  the  two  age  distributions  are 
different.  To  illustrate  how  this  task  should  be  formally  tack¬ 
led,  consider  the  data-generating  model  in  Fig.  5(a)  in  which 
A'  represents  a  treatment,  Y  represents  an  outcome,  Z  repre¬ 
sents  age,  and  S  (graphically  depicted  as  a  square)  is  a  special 
variable  representing  the  set  of  all  unaccounted  factors  (e.g., 
proximity  to  the  beach)  that  creates  differences  in  Z  (age  in 
this  case),  between  the  source  (7r)  and  target  (-7T*)  popula¬ 
tions.  Formally,  conditioning  on  the  event  S  =  s*  would  mean 
that  we  are  considering  population  n* ,  otherwise  population 
7r  is  being  considered.  This  graphical  representation  is  called 
“selection  diagrams”.  11 

Our  task  is  then  to  express  the  query  Q  =  P(y\do(x),  S  = 
s*)  =  P*  (y\do(x))  in  terms  of  the  experiments  conducted  in 
7r  and  the  observations  collected  in  n* ,  that  is,  P(y,  z\do(x)) 
and  P*  (y,  x,  z).  Conditions  for  accomplishing  this  task  were 
derived  in  [39,  45,  46].  To  illustrate  how  these  conditions  work 
in  model  of  Fig.  5(a),  note  that  the  target  quantity  can  be 
re-written  as  follows: 

Q  =  P(y\do(x),  S  =  s*,  z)P(z\S  =  s*,  do{x)) 

Z 

—  P{y\do{x),  z)P{z\S  =  s* ,  do(x )) 

z 

=  '52P(y\do(x),z)P(z\S  =  s*) 

z 

=  Y P(y\do(x),z)P*(z ),  [15] 

Z 

where  the  first  line  of  the  derivation  follows  after  condition¬ 
ing  on  Z ,  the  second  line  from  the  independence  (SALY\Z)g^ 
(called  s- admissibility),  the  third  line  from  the  third  rule  of  the 
do-calculus,  and  the  last  line  from  the  definition  of  S-node. 
Eq.  (15)  is  called  a  transport  formula  because  it  explicates 
how  experimental  findings  in  7r  are  transported  over  to  7r*; 
the  first  factor  is  estimable  from  7r  and  the  second  from  7r*. 

Consider  Fig.  5(b)  where  Z  now  corresponds  to  “language 
skills”  (a  proxy  for  the  original  variable,  age,  which  is  un¬ 
measured).  A  simple  derivation  yields  a  different  transport 
formula  [39],  namely 

Q  =  P{y\do{x)),  [16] 

In  a  similar  fashion,  one  can  derive  a  transport  formula  for 
Fig.  5(c)  in  which  Z  represents  a  post-treatment  variable  (e.g. 
“biomarker”),  giving 

Q  =  ^2P{y\do(x),z)P*{z\x),  [17] 


(a)  (b)  '(c)'  '(d)' 


Fig.  5.  Selection  diagrams  depicting  differences  between  source  and  target  pop¬ 
ulations.  In  (a),  the  two  populations  differ  in  age  ( Z )  distributions  (so  S  points  to 
Z).  In  (b),  the  populations  differs  in  how  reading  skills  ( Z )  depends  on  age  (an  un¬ 
measured  variable,  represented  by  the  hollow  circle)  and  the  age  distributions  are  the 
same.  In  (c),  the  populations  differ  in  how  Z  depends  on  X. .  In  (d),  the  unmeasured 
confounder  (bidirected  arrow)  between  Z  and  Y  precludes  transportability. 

The  transport  formula  in  Eq.  (17)  states  that  to  estimate  the 
causal  effect  of  X  on  Y  in  the  target  population  n* ,  we  must 
estimate  the  2-specific  effect  P(y\do(x),  z)  in  n  and  average  it 
over  2,  weighted  by  the  conditional  probability  P*(z\x)  esti¬ 
mated  at  7r*  (instead  of  the  traditional  P*(z)).  Interestingly, 
Fig.  5(d)  represents  a  scenario  in  which  Q  is  not  transportable 
regardless  of  the  number  of  samples  collected. 

The  models  in  Fig.  5  are  special  cases  of  the  more  gen¬ 
eral  theme  of  deciding  transportability  under  any  causal  dia¬ 
gram.  It  can  be  shown  that  transportability  is  feasible  if  and 
only  if  there  exists  a  sequence  of  rules  that  transforms  the 
query  expression  Q  =  P(y\do(x),  s*)  into  a  form  where  the 
do-operator  is  separated  from  the  S-variables  [45].  A  com¬ 
plete  and  effective  procedure  was  devised  by  [45,  46],  which 
given  any  selection  diagram,  decides  if  such  a  sequence  exists 
and  synthesizes  a  transport  formula  whenever  possible.  Each 
transport  formula  determines  what  information  need  to  be  ex¬ 
tracted  from  the  experimental  and  observational  studies  and 
how  they  ought  to  be  combined  to  yield  an  estimate  of  Q. 

Transportability  from  multiple  populations.  A  generalization 
of  transportability  theory  to  multi-environments  when  limited 
experiments  are  available  in  each  environments  led  to  a  princi¬ 
pled  solution  to  the  data-fusion  problem.  Data-fusion  aims  to 
combining  results  from  many  experimental  and  observational 
studies,  each  conducted  on  a  different  population  and  under 
a  different  set  of  conditions,  so  as  to  synthesize  an  aggregate 
measure  of  targeted  effect  size  that  is  “better,”  in  some  sense, 
than  any  one  study  in  isolation.  This  fusion  problem  has  re¬ 
ceived  enormous  attention  in  the  health  and  social  sciences, 
and  is  typically  handled  by  “averaging  out”  differences  (e.g., 
using  inverse- variance  weighting),  which,  in  general,  tends  to 
blur,  rather  than  exploit  design  distinctions  among  the  avail¬ 
able  studies. 

Fortunately,  using  multiple  “selection  diagrams”  to  encode 
commonalities  among  studies,  [47]  “synthesized”  an  estimator 
that  is  guaranteed  to  provide  unbiased  estimate  of  the  desired 
quantity,  whenever  such  estimate  exists.  It  is  based  on  infor¬ 
mation  that  each  study  shares  with  the  target  environment. 
Remarkably,  a  consistent  estimator  can  be  constructed  from 
multiple  sources  with  limited  experiment  even  in  cases  where 
it  is  not  constructable  from  any  subset  of  sources  considered 
separately  [48].  We  summarize  these  results  as  follows: 
Summary  Result  3.  (Transportability  and  Data-fusion)  We  now  pos¬ 
sess  complete  solutions  to  the  problem  of  transportability  and 
data-fusion,  which  entail  the  following: 

•  Graphical  and  algorithmic  criteria  for  deciding  transporta¬ 
bility  and  data-fusion  in  non-parametric  models; 


1:LEach  diagram  shown  in  Fig.  5  constitutes  indeed  the  overlapping  of  the  causal  diagrams  of  the 
source  and  target  populations.  More  formally,  each  variable  V \  should  be  supplemented  with  an 
S-node  whenever  the  underlying  (unobserved)  function  (mechanism)  fi  or  background  factor  Ui 
is  different  between  7 r  and  7 r*.  If  knowledge  about  commonalities  and  disparities  is  not  available, 
transport  across  domains  cannot,  of  course,  be  justified. 
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•  Automated  procedures  for  extracting  transport  formulae 
specifying  what  needs  to  be  collected  in  each  of  the  underly¬ 
ing  studies; 

•  An  assurance  that,  when  the  algorithm  fails,  fusion  is  in¬ 
feasible  regardless  of  the  sample  size. 

For  detailed  discussions  of  these  results,  see  [39,  J^6,  48]. 


Conclusion 

The  unification  of  the  structural,  counterfactual,  and  graph¬ 
ical  approaches  to  causal  analysis  gave  rise  to  mathematical 
tools  that  have  helped  to  resolve  a  wide  variety  of  causal  infer¬ 
ence  problems,  including  the  control  of  confounding,  sampling 
bias,  and  cross-population  bias.  In  this  paper,  we  presented 
a  general  approach  to  these  problems,  based  on  a  syntactic 
transformation  of  the  query  of  interest  into  a  format  derivable 
from  the  available  information.  Tuned  to  nuances  in  design, 
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this  approach  enables  us  to  address  a  crucial  problem  in  big 
data  applications:  the  need  to  combine  datasets  collected  un¬ 
der  heterogeneous  conditions,  so  as  to  synthesize  consistent 
estimates  of  causal  effects  in  a  target  population.  As  a  by¬ 
product  of  this  analysis,  we  arrived  at  solutions  to  two  other 
long-held  problems:  Recovery  from  sampling  selection  bias 
and  generalization  of  randomized  clinical  trials.  We  hope  that 
the  framework  laid  out  in  this  paper  will  stimulate  further  re¬ 
search  to  enhance  the  arsenal  of  techniques  for  drawing  causal 
inferences  from  big  data. 
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